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(57) Abstract 

Documents stored in a database are searched for relevance to contextual information, instead of (or in addition to) similar text. 
Each stored document is indexed in terms of meta-information specifying contextual information about the document. Current contextual 
information is acquired, either from the user or the current computational or physical environment, and this "meta-information" is used as 
the basis for identifying stored documents of possible relevance. 
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METHOD AND APPARATUS FOR AUTOMATED, 
CONTEXT-DEPENDENT RETRIEVAL OF 
INFORMATION 

RELATED APPLICATION 

This application is based upon and claims priority from U.S. provisional applica- 
tion serial no. 60/062,111, filed October 14, 1997. 

BACKGROUND OF THE INVENTION 
The tremendous amounts of information now available even to ceisual computer 
users, particularly over large computer networks such as the Internet, have engendered 
numerous efforts to ease the burden of locating, filtering, and organizing such informa- 
tion. These include classification and prioritization systems for e-mail (see, e.g., Maes, 
Commun. of ACM 37(7):30-40 (1994); Cohen, "Learning Rules that Classify E-mail," 
AAAI Spring Symposium on Machine Learning in Information Access, March 1996), 
systems for filtering news downloaded from the Internet (see, e.g., Lang, **NewsWeeder: 
Learning to Filter Netnews," Machine Learning: Proc. of 12th Int 7 Conf (1995)), and 
schemes for organizing user-specific information such as notes, files, diaries, and calen- 
dars {see, e.g., Jones, IntlJ. of Man-Machine Studies 25 at 191-228 (1986); Lamming 
et al., "Forget-me-not: Intimate Computing in Support of Human Memory," Proc. 
FRIEND21, '94 Int 7 Symp. on Next Generation Human Interface (1994)). 

Systems designed for information retrieval generally function in response to ex- 
plicit user-provided queries. They do not, however, assist the user in formulating a 
query, nor can they assist users unable or unwilling to pose them. The Remembrance 
Agent ("RA"), described in Rhodes et al., Proc. of 1st Int 7 Conf on Practical Applica- 
tion of Intelligent Agents and Multi-Agent Technology at 487-495 (1996), is a computer 
program that watches what a user is typing in a word processor (specifically the Emacs 
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UNIX-based text editor) and continuously displays a list of documents that might be 
relevant to the document currently being written or read. For example, if a journalist is 
writing a newspaper article about a presidential campaign, the RA might suggest notes 
from a recent interview, an earlier article about the campaign, and a piece of e-maU from 
her editor suggesting revisions to a previous draft of the article. 

The utility of the RA stems from the fact that currently available desktop com- 
puters are fast and powerful, so that most processing time is spent waiting for the user 
to hit the next keystroke, read the next page, or load the next packet off the network. 
The RA utilizes otherwise-wasted CPU cycles to perform continuous searches for in- 
formation of possible interest to the user based on current context, providing a continu- 
ous, associative form of recall. Rather than distracting from the user's primary task, the 
RA serves to augment or enhance it. 

The RA works in two stages. First, the user's collection of text documents is 
indexed into a database saved in a vector format. These form the reservoir of docu- 
ments from which later suggestions of relevance are drawn; that is, stored documents 
will later be "suggested" as being relevant to a document currently being edited or read. 
The stored documents can be any sort of text document (notes, Usenet entries, 
webpages, e-mail, etc.). This indexing is usually performed automatically every night, 
and the index files are stored in a database. After the database is created, the other stage 
of the RA is run from Emacs, periodically taking a sample of text from the working 
buffer. The RA finds documents "similar'' to the current sample according to word 
sinularities; that is, the more times a word in the current sample is duplicated in a candi- 
date database document, the greater will be the assumed relevance of that database 
document. The RA displays one-line sunmiaries of the best few documents at the bot- 
tom of the Emacs vwndow. These summary lines contain a line number, a relevance 
ranking (from 0.0 = not relevant to 1.0 = extremely relevant), and header information to 
identify the document. The list is updated at a rate selectable by the user (generally 
every few seconds), and the system is configured such that the entirety of a suggested 
document can be brought up by the user pressing the "Control-C" key combmation and 
the line number to display. 
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Briefly, the concept behind the indexing scheme used in RA is that any given 
document may be represented by a multidimensional vector, each dimension or entry of 
which corresponds to a single word and is equal in magnitude to the number of times 
that word appears in the document. The number of dimensions is equal to the number 
of allowed or indexed words. The advantages gained by this representation are rela- 
tively speedy disk retrieval, and an easily computed quantity indicating similarity be- 
tween two documents: the dot product of their (normalized) vectors. 

The RA creates vectors in three steps: 

1. Removal of common words (called stop words), identified in a list of stop 

words. 

2. Stemming of words (changing "jumped" and "jumps" to "jump," for exam- 
ple). This is preferably accomplished using the Porter stemming algorithm, a stemdard 
method in the text-retrieval field. 

3. Vectorization of the remaining text into a "document vector^' (or "docvec"). 
Conceptually, a docvec is a multidimensional vector each entry of which indicates the 
number of times each word appears in the document. 

For example, suppose a document contains only the words: "These remembrance 
agents are good agents." 

Step 1 : Remove stop words 
This converts the text to "Remembrance agents good agents" 

Step 2: Stem words 
This converts the text to "remembr agent good agent" 

Step 3 : Make the document vector 
This produces the vector: 
0 0 0... 1 2 1 ...00 0 

Each position in the vector corresponds to an allowed word. The zeroes repre- 
sent all allowed words not actually appearing in the text. The non-zero numerals indi- 
cate the number of times the corresponding word appears, e.g., a 1 for the words 
"good" and "remembr," and a 2 for the word "agent"; thus, the numbers indicate the 
document "weight" for the word in question. 

Step 4: Normalize the vector 
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Document vectors are normalized (i.e., divided by the magnitude of the vector). The 
vector magnitude is given by the square root of the sum of the squared weights. (In 
fact, the normalization step takes place in the context of other computations, as de- 
scribed more fully below.) Normalization facUitates meaningful comparison between the 
words in a query and the words in a document in terms of their relative importance; for 
example, a word mentioned a few times in a short document carries greater significance 
than the same word mentioned a few more times in a very long document. 

In a more recent implementation of the RA, a fifth step is added to improve the 
quality of matching beyond that attainable based solely on term firequency. In this fifth 
step, vectors are weighted by the inverse of the document fi-equency of the term, based 
on the assumption that words occurring fi-equently in a document should carry more 
weight than words occurring fi-equently in the entire indexed corpus (which are less dis- 
tinguishing). More rigorously, the similarity between two word vectors is found by 
multiplying the document term weight (DTW) for each term by the query term weight 
(QTW) for that term, and summing these products: 

relevance = ^ DTW- QTW 

where 

(/••logf 



5 



DTW = 



and 

(0-5 + S^)-log^ 

QTW = 



The document term weight is computed on a document-by-document basis for 
each indexed word in the document vector. Because it does not change untU new 
documents are added to the corpus, these computations may take place only when the 
corpus is indexed and re-indexed. The summation in the denominator covers all words 
in the document vector (i.e., all indexed words) that also appear in the current document 
for which DTW is computed (since a sununation term is zero otherwise); this facilitates 
normalization. The term frequency (/"refers to the number of times a particular term ap- 
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pears in the current document; Nis the total number of documents in the corpus; and n 
is the number of documents in which the term appears. The summation is taken over 
each indexed word (the first through the rth) in the document. The DTW of a term 
within a document, then, reflects the number of times it appears within the document 
reduced in proportion to its frequency of appearance throughout all documents. 

The QTW is computed for each word (the first through the rth) in the query 
vector. In this case, tf refers to the number of times the word appears in the query vec- 
tor, and max tf refers to the largest term frequency for the query vector. If the docu- 
ment term weight is greater than the query term weight, then the former is lowered to 
match the query term weight (in order to prevent short documents from being favored). 

The RA, running within Emacs, takes a sample of text every few seconds from 
the current document being edited. This text sample is converted into a vector (called a 
"query vector"') by the four-step process set forth above. After computing the query 
vector, the RA computes the dot product of the query vector with every indexed docu- 
ment. This dot product represents the "relevance" of the indexed document to the cur- 
rent sample text, relevance being measured in terms of word matches. One-line summa- 
ries of the top few most relevant documents are listed in the suggestions list appearing at 
the bottom of the Emacs vsdndow (the exact number displayed is customizable by the 
user). 

Documents to which sampled text is compared need not be entire files. Instead, 
for example, files can be divided into several "virtual documents" as specified in a tem- 
plate file. Thus, an e-mail archive might be organized into multiple virtual documents, 
each corresponding to a piece of e-mail in the archive. Alternatively, one can index a 
file into multiple "windows" each corresponding to a portion of the file, such that, for 
example, each virtual document is only 50 or so lines long, with each window overlap- 
ping its neighbors by 25 lines. (More specifically. In this representation, Avindow one 
includes lines 0-50 of the original document, window two includes lines 25-75, etc.) 
This format makes it possible to suggest only sections of a long document, and to jump 
to that particular section when the entirety of the document is brought up for viewing. 

Experience with the RA has shown that actually performing a dot product with 
each indexed document is prohibitively slow for large databases. In preferred implemen- 
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tations, therefore, document vectors are not stored; instead, word vectors are stored. 
The "wordvec" file contains each word appearing in the entire indexed corpus of docu- 
ments, followed by a list of each document that contains that particular word. The 
documents are represented by an integer value (generally 4 bytes) encoding both the 
document number and the number of times that word appears in that particular docu- 
ment. The wordvec file format is as follows: 

(int) (width*uns int) (int) (una int) 

int) 

NUM_WORDS, WORDCODE-1, NUM_DOCS=Nl, DOC-1, 

DOC-Nl , 

WORDCODE-2, NUM_DOCS=N2 , DOC-1, 

DOC-N2 , 

etc. 

The headings indicate the type of data each variable represents (integer, unsigned 
integer). The first entry in the wordvec file, NUMJWORDS, is the number of words 
appearing in the entire file. Each word in the wordvec is represented by a unique nu- 
merical code, the "wdth" indicating the number of integers in the code (the RA uses 
two integers per code). The NUM_DOCS field indicates the number of documents 
containing the word specified by the associated wordcode. The word-count variables 
DOC-1, DOC-2, DOC-Nl each correspond to a document containing the word, and 
reflect the number of occurrences of the word divided by the total number of words in 
the the document. 

A word offset file contains the file offsets for each word in the wordvec file, and 
is used to overcome the difficulties that would attend attempting to locate a particular 
wordcode in the wordvec file. Because each wordcode in the wordvec file can be asso- 
ciated with an arbitrary number of documents, locating a particular wordcode would re- 
quire searching wordcode by wordcode, jumping between wordcodes separated by the 
arbitrary numbers of intervening word-count variables. To avoid this, a "wordvec off- 
set" file is used to specify the location of each wordcode in the wordvec file. 



(uns int) (uns 

DOC-2 , , 

DOC-2 , . • . » 
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(width*uns int) 
WORDCODE-l, 
WORDCODE-2, 



Gong) 
OFFSET-1, 
OFFSET-2, 



etc. 



5 



Since each entry has a fixed length, it is possible to perform rapid binary searches on the 
wordvec offset file to locate any desired wordcode. 

Accordingly, for each word in the query vector, the RA first looks up the word 
in the word offset file, and fi-om that the word's entry is looked up in the wordvec file. 

10 An array of document similarities is used to maintain a running tally of documents and 
their similarities, in terms of numbers of word matches, to the query vector. The array is 
sorted by similarity, with the most similar documents at the top of the list. Similarity is 
computed for each word in the query vector by taking the product of the query-vector 
entry and the weight of each document in the corresponding wordvec file. To normalize 

15 this product, it is then divided by the query-vector magnitude (computed in the same 
manner as the document magnitude) and also by the document magnitude. The final 
value is added to the current running-total similarity for that document, and the process 
repeated for the next word in the query. In summary, the query vector is analyzed 
wordcode by wordcode, with the similarities array indicating the relevance to the query 

20 of each document. 

When computing the similarity of a query to an indexed document, it is preferred 
to employ a "chopping" approach that prevents an indexed word in a document fi-om 
having a higher weight than the word has in the query vector. If the weight of the word 
in the indexed document is higher than its weight in the query vector, the document 

25 weight gets "chopped" back to the query's value. This approach avoids situations 
where, for example, a query containing the word "spam" as just a single unimportant 
word vn\l not get overwhelmingly matched to one-word documents (which have the 
highest possible weight) or documents like "spam spam spam spam eggs bacon spam..." 
This method is slower on indexing and the index files take more space, but is much 

30 faster on retrieval because only documents containing words in the query are even exam- 
ined. 
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The other files created on indexing are a location file (docjocs) containing a 
mapping between document number and filename for that document, a titles file contain- 
ing the information for the one-line sunrmiary (titles), offset files for docjocs and titles 
(dl_ofFs and t_ofFs) to do quick lookups, and a window-offset file specifying where to 
jump in a file for a particular portion of a windowed document. 

While the RA ofifers substantial capabilities for automated, "observational" re- 
trieval, the cues it utilizes to identify possibly relevant documents are limited to word 
similarities. This is adequate for many computational tasks and clearly suits the tradi- 
tional desktop environment of everyday computing: if the user is engaged in a word- 
related computational task, word-based cues represent a natural basis for relevance de- 
terminations. In other words, the current information reliably indicates the relevance of 
similar information. More broadly, however, human memory does not operate in a vac- 
uum of query-response pairs. Instead, the context as well as the content of a remem- 
bered episode or task firequently embodies information bearing on its relevance to later 
experience; the context may include, for example, the physical location of an event, who 
was there, what was happening at the same time, and what happened immediately before 
and afl:er. 

As computer components grow smaller and less expensive, so-called "wearable" 
computers that accompany the user at all times become more feasible. Users will per- 
form an ever-increasing range of computational tasks away firom the desktop and in the 
changing environmental context of everyday life. Consequently, that changing context 
will become more relevant for recall purposes. Even now, inexpensive laptop computers 
allow users to monitor their physical locations via global-positioning systems ("GPSs") 
or infi^ared ("IR") beacons, and to access various kinds of environmental sensors or 
electronic identification badges. Since information is created in a particular context, the 
attributes of that context may prove as significant as the information itself in determining 
relevance to a future context. 

Contextual "meta-information" is not limited to physical surroundings. Even in 
traditional desktop envirormients, where for practical purposes the physical context re- 
mains constant, meta-information such as the date, the time of day, the day of the week. 
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or the general subject can provide cues bearing on the relevance of information 
(regardless of, or more typically in addition to, the content of the information itself). 
Word-based searching and retrieval systems such as the RA are incapable of capturing 
these meta-informational cues. 



SUMMARY OF THE INVENTION 
The present invention improves on the RA by extending its comparative capa- 
bilities beyond word similarities. The invention monitors various aspects of the user's 
computational and/or physical environment, and utilizes these as bases for relevance 
suggestions. Accordingly, unlike the RA, the "context" for assessing relevance is not 
what the user is typing, but can instead be any kind of information about the user's cur- 
rent situation. Examples include the user's current location (room or GPS location), the 
time of day, the day of the week, the date, the subject being discussed, person being 
talked to, etc. In this way, the invention can remind a user of personal information rele- 
vant to the current environment, or use environmental cues in effect as search vectors in 
a broader search that extends beyond these cues. As a result, the invention can be im- 
plemented not only in the traditional computing environment in which RA operates, but 
also in fundamentally different environments. For example, the invention may be imple- 
mented as a wearable or portable memory aid. 

In the RA, only the words within an indexed document are used to determine 
relevance. In accordance with the present invention, by contrast, these documents may 
be associated with a wide range of meta-information (i.e., information about the infor- 
mation), and it is this meta-information that is used to determine relevance — either alone 
or, if desired, in combination with the lexical comparisons implemented by RA Meta- 
information about a document can be entered explicitly by a user, can be tagged auto- 
matically when the information is first created, or can be generated by analyang the 
structure of the document. For example, if a student were vmting notes in class, she 
could explicitly write the current date in a special header at the top of the notes; the date 
would then function as meta-information searchable by the present invention. Alterna- 
tively, the notes might automatically be tagged with the date based on the system clock. 
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Finally, if the notes were instead e-mail, the e-mail would already bear a timestamp, so a 
system configured to recognize the structure of an e-mail document could glean the date 
the mail was sent fi-om the existing e-mail header without modification or special han- 
dling. 

Meta-information usefiil in the context of the present invention can include far 
more than just date, of course, and the invention works best if several different kinds of 
meta-information are available. Some examples illustrate various forms of meta- 
information and their relevance to the capabilities of the present invention: 

Scenario #1 : A student takes notes in class, and as her notes are saved as files, 
they are automatically tagged with several pieces of meta-information, including the 
room in which the notes were taken, the date, the day of the week, and the time of day. 
As she enters the classroom a week later, an infi-ared (IR) beacon broadcasts the room 
number to her wearable computer. The time of day and day of the week are also avaU- 
able to the computer fi-om its system clock, and the invention automatically brings up the 
previous week's class notes as a "relevant" document on her computer. 

Scenario #2: A salesman is at a trade show and meets a potential client at the 
booth. He does not recognize the client, but the trade show has suppUed everyone with 
name badges that also broadcast the person's name via an IR carrier beacon. The 
salesman's wearable computer receives that information from the potential client, and 
matches the person name to a note file written two years ago at a previous trade show. 
The notes concerned a previous meeting in which the potential client had listed his needs 
and business plans for the fiiture; since at that previous meeting there were no active 
badges, the salesman had explicitly tagged the note with the person's name by typing it 
in. Because of the name match, the invention now displays the relevant information on 
his eyeglass-mounted computer display, allowing him to make a more focused sales 
pitch. 

Scenario #3 : A tourist is visiting Seattle for the first time, and his car is 
equipped with a dashboard computer running the invention. As he drives around, the 
invention brings up notes firom his wife's previous trip to Seattle, based on the location 
supplied from the car's global-positioning system and the time of day. He winds up go- 
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ing to a restaurant in which his wife had entered the note "Great place for lunch — ^try the 
fish." 

Scenario #4: A businesswoman has indexed her day-planner files, and the inven- 
tion has gleaned the dates, times, and locations for appointments firom the structured 
files. The invention reminds her of the time for her dentist appointment as this draws 
near. When she drives by the grocery store on the way back fi*om the dentist, however, 
her location (supplied by GPS) triggers the invention to automatically remind her of her 
calendar entry "get birthday cake [Quality Foods]." In this case, the calendar entry was 
tagged both with a date and a machine-readable location. 

Accordingly, in a first aspect, the invention provides an apparatus for context- 
based document identification. The apparatus, ordinarily implemented on a program- 
mable computer, includes a database for indexing a plurality of documents in terms of 
meta-information that specifies contextual information about the document or its con- 
tents; means for acquiring current contextual information (e.g., regarding a user's physi- 
cal or computational environment); means for searching the database to locate docu- 
ments whose meta-information is relevant to the current contextual information; and 
means for reporting the identified documents to the user. 

In a second aspect, the invention comprises a method of identifying documents 
firom a stored document database in response to contextual information. The method 
comprises indexing each stored document in terms of meta-information specifying con- 
textual information about the document; acquiring current contextual information; 
identifying stored documents whose meta-information comprises information relevant to 
the current contextual information; and reporting the located documents to the user. 

BRIEF DESCRIPTION OF IBE DRAWING 

The detailed description below refers to the accompanying drawing, which illus- 
trates a representative hardware platform for practicing the invention. 
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DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT 
While the present invention can utilize any kind of meta-inforaiation about a 
document, certain kinds of sensors are particularly preferred. It should be stressed, 
however, that meta-information need not be provided by physical or computational sen- 
sors, but instead may be entered explicitly by the user, or through analysis of the struc- 
ture of the document currently being indexed (or viev^ed by the user). Furthermore, the 
term "document" need not coimote or be limited to a traditional text file; instead, the 
context-based approach of the present invention can be used to identify materials such as 
images, video or audio (either explicitly or automatically recorded) not easily searchable 
by conventional means. 

A representative list of meta-information items usefiil in accordance v^th the 
present invention, along with potential sources and the meaning attributable to the meta- 
information, is as follows: 



Meta-information: 
Supplied by: 
Meaning: 



Meta-information: 
Supplied by: 



Meaning: 



Meta-information: 
Supplied by: 



Meaning: 



Meta-information: 
Supplied by: 

Meaning: 



LOCATION 

JR beacon, GPS, human-entered, included in document 
Place where note was taken, information 
regarding this place 

PERSON 

IR-transmitting name tag, human-entered, 
video face-recognition, biometrics (voice-print), 
included in document (e.g., "fi-om" field in e-mail) 
Person/people who were there when note was taken, 
information regarding this person; person/people otherwise 
associated with document (e.g., author) 

DATE (e.g., date and timestamp) 

System clock, included in document (e.g., calendar 

entry), entered by person 

Date/time when note was taken, information 

regarding this time 

TIME-OF-DAY 

System clock, included in document (e.g. calendar 
entry), entered by person 

Time-of-day when note was taken, information is 
regarding this time 



) Meta-information: DAY-OF-WEEK 
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Supplied by: 
Meaning: 



Meta-information: 
Supplied by: 



Meaning: 



Meta-information: 
Supplied by: 
Meaning: 



System clock, included in document (e.g. calendar 
entry), entered by person 

Day of week when note was taken, information is 
regarding this day of the week. 

SUBJECT 

Included in document (e.g., subject line in e-mail, 
or key words in technical paper), entered by person, 
speech-recognition software doing word spotting 
This is the subject of a piece of information, or 
the subject of a current conversation or discussion 

DOMAIN-SPECIFIC INFO / OTHER 
Variable 

Many specific applications have special kinds of 
information that may be used as meta-information. 
For example, a shopper might be 
interested in products names. When the shopper 
passes an item on the shelf at the supermarket, 
the invention might remind him that that product is on 
his shopping list. 

Text being edited, text on screen 
Current computational situation (on screen) 
This information is not really meta-infomation, but 
is rather the information itself, vectorized in the 
same way the RA vectorizes infomation. 
This information may be treated the same as any other 
sensory information. 

Many kinds of meta-information, such as people's names, can be represented as 
strings of words. Such information is considered "discrete," in the sense that a person 
either is or is not Jane Doe, and if she is not Jane Doe she cannot be "a close second." 
Other kinds of information are "continuous," such as times or latitude and longitude. 
With continuous information, values can be near each other, far from each other, or 
anywhere in between. 

Refer now to FIG. 1, which illustrates, in block-diagram form, a hardware plat- 
form incorporating a representative, generalized embodiment of the invention. As indi- 
cated therein, the system includes a central-processing unit ("CPU") 100, which per- 
forms operations on and interacts with a main system memory 103 and components 
thereof System memory 103 typically includes volatile or random-access memory 



Meta-infomation: 
Supplied by: 
Meaning: 
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("RAM") for temporary storage of information, including buffers, executing programs, 
and portions of the computer's basic operating system. The platform typically also in- 
cludes read-only memory ("ROM") for permanent storage of the computer's configura- 
tion and additional portions of the basic operating system, and at least one mass storage 
device 106, such as a hard disk and/or CD-ROM drive. All components of the platform 
are interconnected by, and communicate over, a bidirectional system bus 110. 

The platform preferably also includes one or more sensors 113 for gathering en- 
vironmental or physical information. Sensors 113 may include, for example, means for 
ascertaining the current geographical location of the platform (by means of a GPS cir- 
cuit), local environmental or positional information (by means of an IR beacon, radio 
broadcast receiver, or transducer circuit), or the identities of individuals in the immediate 
area (by means of a receiver for IR-transmitting name tags, video face-recognition cir- 
cuitry, or biometric-analysis circuitry (e.g., voice-print)). The operation of sensors 113 
is regulated by a meta-information interface ("MIT') 116, which also provides access to 
the information obtained by currently active sensors. In addition, MCE 1 16 provides ac- 
cess to meta-information originated by the operating system, e.g., the date, the time of 
day, and the day of the week. 

The user interacts with the platform by means of, for example, a keybozu-d 120 
and/or a position-sensing device (such as a mouse) 123, and an output device 126 (a 
conventional display or, for example, audio output fi-om headphones or an automobile 
dashboard speaker). It should be stressed, however, that these components need not 
take the form normally associated with desktop or laptop computers, since the invention 
is amenable to use with wearable data-processing equipment. Thus, display 126 may be 
an eyepiece projecting a virtual image instead of a CRT or flat-panel screen, and posi- 
tion-sensing device 123 may be a small hand-^held sensor pad or a wireless mouse. Fur- 
thermore, the components of the platform need not reside within a single package. 
Again, given the varying architectures of actual and proposed wearable computing sys- 
tems, different components may reside in different physical locations, linking to one an- 
other by wired network circuits or bodybome electrical signals. 

The m^n memory 103 contains a group of modules that control the operation of 
CPU 100 and its interaction with the other hardware components. These modules are 
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implemented as executable machine instructions, running (by means of CPU 100) as ac- 
tive processes effectively capable of interacting (i.e., exchanging data and control com- 
mands) as illustrated. An operating system 130 directs the execution of low-level, basic 
system functions such as memory allocation, file management, and operation of mass 
storage devices 106. At a higher level, an analyzer module 133 directs execution of the 
primary functions performed by the invention, as discussed below; and instructions defin- 
ing a user interface 136 allow straightforward interaction over display 126. User inter- 
face 136 generates words or graphical images on display 126 to facilitate user action and 
examination of documents, and accepts user commands from keyboard 120 and/or posi- 
tion-sensing device 123. 

The current document (i.e., the file with which the user is interacting) resides in a 
document buffer 140. A document index buffer 143 contains a vectorized index of all 
candidate documents, which are themselves stored on mass storage device 106. Meta- 
information obtained from MIX 116 is stored in a memory partition 146, and is used by 
analysis module 133 in performing searches of the index buffer 143 in accordance with 
the invention. Memory 103 may also contain a series buffers 150 for storing document 
templates, which are used by analysis module 133 to index and, possibly, to locate meta- 
information as discussed below. Again, the system may contain large numbers of tem- 
plates stored on device 106, only some of which reside at any one time in buffers 150. 

It must be understood that although the modules of main memory 103 have been 
described separately, this is for clarity of presentation only; so long as the system per- 
forms all necessary functions, it is immaterial how they are distributed within the system 
and the programming or hardware architecture thereof Furthermore, while arrows indi- 
cate the operative relationships among the memory modules, and dashed arrows the re- 
lationships between memory modules and hardware components, the actual basis for 
communication is provided by operating system 130 and CPU 100. 

Analysis module 133 first indexes all the documents in a corpus of data (which, 
again, are stored as files mass storage device 106, which is assumed for explanatory 
purposes to be a hard disk), and writes the indices to disk. Unlike the RA, the invention 
preferably keeps several vectors for each document. These include not only the word- 
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vec vector for text (if any) in the document, but also vectors for meta-information, e.g., 
subject, people, time, date, day of week, location, etc. 

Analysis module 133 indexes a document as follows: 

1 . Identification: Documents are broken into types based on a template file 
specific to the particular type of document. For example, an e-m2ul file mcludes the fol- 
lowing recognition criteria: 

Template plain_email 
{ 

Recognize 
{anyorder {startline, "From: ") 
{startline, "Date: "}} 

Because the invention is intended to search different kinds of documents, rec- 
ognition criteria are used to indicate the manner in which particular kinds of files are or- 
ganized to facilitate the search. For example, recognition criteria can set forth file or- 
ganization, indicating the manner in which files are separated fi-om one another. The 
recognition criteria can also indicate where specific pieces of meta-information (e.g., the 
date, subject, recipient, etc.) are located, as well as the location of document text. Rec- 
ognition criteria for particular types of documents are contained in templates. The in- 
vention checks for the presence of a template matching a particular document type, and 
if one is found, the recognition criteria therein are employed. If no template is matched, 
the document is considered to be raw text. (If more than one template is matched, the 
first is used.) 

2. After the document is identified, different fields are extracted, again based on 
the template. For example, the e-mail template continues: 

Delimiter 

{startline, "From "} 
Format 

{{anyorder {startline, "From: ", PERSON, "\n"} 

{startline, "Date: DATE, "\n"} 

optional {startline, "Subject: ", SUBJECT, "\n"}} 
"\n\n", BODY} 

} 

Bias 2 1 10 0 0 0 0 
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The delimiter command explicitly identifies the separator between one document 
of this template type and another, should they both reside in the same file. (For exam- 
ple, a plain e-mail archive may contain several pieces of mail in the same file, all sepa- 
rated by the word "From" plus a space at the start of a line.) The remainder of the tem- 
plate specifies that the "From:" line contains the person or people associated v^th this 
document, and the line starting with "Date:" contains the date/timestamp of the docu- 
ment. 

Templates can also be employed during document creation, modification or stor- 
age to guide the acquisition of meta-information fi^om Mil 116 and its association with 
the document (the template typically being invoked in this situation by the application 
program responsible for creating the document). That is, a template field may point not 
to data within the document, but to meta-information acquired by MH 116 and/or sen- 
sors 113 that is to be stored with the document as part of its header. Suppose, for ex- 
ample, that the template for a document specifies a meta-information field indicating the 
geographic location where the document was created. Depending on the configuration 
of the system, that information may be continually acquired by a GPS sensor 113 and 
periodically placed in meta-information memory partition 146 by MH 116. Alterna- 
tively, GPS sensor 113 may be only intermittently active in response to a command is- 
sued by MH 116. In this case, the template instructs analysis module 133 to request a 
geographic location fi-om MH 116, which in response activates the GPS sensor and 
loads the acquired value into partition 146. Numerous variations on these approaches 
are of course possible, combining varying firequencies of sensor activity, data acquisition 
and storage, as well user action. In this way, meta-information may be automatically 
associated with a document at appropriate junctures without distracting the user. 

For indexing purposes, the template structure of the present invention may be 
similar to the templates used by the RA, but with a different interpretation. With the 
RA, the date and person information was only used to create the one-line sunrniary. In 
accordance with the present invention, each type of meta-infomation is placed in its own 
vector, and a single vector represents each type of meta-information supported by the 
invention. 



wo 99/19816 



PCT/US98/21291 



-18- 



The final entry in the template file is the bias number for the particular type of 
file, which ranks the fields of the file in terms of importance. In the e-mail example 
above, the bias means that the body of the e-mail is most important, person and date 
fields are secondary (in a ratio 2 to 1 to 1), and no other fields are used to compute 
similarity. 

3. Vectorization 

Once information is parsed out of the document, it is encoded and vectorized. 
The encoding is as follows. The invention uses three mtegers to encode words (as com- 
pared with the two-integer wordcodes of the RA). Consequently, each character is 6 
bits, with the most significant 6 bits in the first integer being the type field. Bits wrap to 
the next byte, as follows: 

tttttt 1 1 1 1 1 1 222222 333333 444444 55 = 32 bits 
5555 666666 777777 888888 999999 0000 
GO mill 222222 333333 444444 555555 



= 15 characters, 6 bits type 



Code 


Type 


0x0 


Body 


0x1 


Text Location (descrete) 


0x2 


Subject 


0x3 


Person 


0x4 


Date 


0x5 


Time 


0x6 


Day 


0x7 


GPS Location (continuous) 



Characters are packed into a 6-bit packed representation: 

a-z = 01-1 A 
0-9 = lB-24 
_ =25 
- =26 
! =27 

Anything else gets mapped to ascii(c) & 0x3F (lower 6 bits) 

Day of week is simply encoded as a number, 0-7, plus the type bits. Date is en- 
coded as type bits plus number of seconds since the epoch (January 1, 1970, 
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12:00:01AM). Time of day is encoded as number of seconds since midnight, plus type 
bits. Any meta-information that can be represented by text (e.g., subject lines, room 
names, people names, bodies of text, etc.) is encoded in accordance with the above 
scheme. Like the body of text, each word in these text strings is encoded seperately and 
added to a vector. Vectors of discrete (text) data are all stored in one file, but the vec- 
tors are still conceptually distinct and are distinguished by their type bits. The file for- 
mat for discrete type information is the same as the wordvec file format. Non-discrete 
information is stored in its own separate file, in order to better search for information 
that is "close enough" to information in a query vector. 

4. Determination of relevance 

For each element of each discrete vector in a query — the generation and vectori- 
zation of which is described below — ^the algorithm used by the RA may be used to de- 
termine relevance to documents in the corpus. For "continuous" vectors (e,g., date, 
GPS location, etc.), the algorithm is modified to permit degrees of similarity, producing 
a value between 0.0 and 1.0. For example, for each date in the query's date vector, a 
binary search on the date_file is performed to find any dates within 24 hours of the query 
date. These are given a distance value based on how close — ^that is, how temporally 
proximate — ^the values are to the query date. These distance values are converted to 
weighted similarity, and are added to the similarity of the date vectors in the same way 
as in the discrete case. 

5. Weighted addition of vectors 

The result of the foregoing operations is a single similarity value for each type of 
meta-information. These values are associated with each document in the indexed cor- 
pus, and are used to compute the overall similarity using bias values for query and 
document types, by the following formula: 

Query biases = bq pq sq Iq dq etc. (i.e., body__queryJbias, person_query_bias, etc.) 

Index bias = bi pi si li di etc. (i.e., biases for this indexed document, gleaned firom 
the template file) 

Non-normalized biases = bq*bi pq*pi sq*si lq*li dq*di etc. 

Normalized biases = bq*bi/M pq*pi/M sq*si/M lq*li/M dq*di/M etc. 
where M = magnitude = (bq*bi + pq*pi + sq*si + lq*li + dq*di) 
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Each vector similarity is multiplied by its respective bias, and the resulting biased 
similarity is summed, to produce an overall similarity between zero and one. 

Analysis module 133 preferably generates search queries autonomously from the 
current document in document buffer 140 or by reference to a current context. In the 
former case, analysis module 133 classifies the document either by its header or by ref- 
erence to a template, and extracts the appropriate meta-information. In the latter case, 
the user's physical or interpersonal surroundings furnish the meta-information upon 
which the query is based. It is not necessary for the documents searched or identified to 
correspond in type to a current document. Furthermore, the query may not be limited to 
meta-information. Instead, the invention may utilize both a meta-information compo- 
nent (with relevance to candidate documents determined as discussed above) and a text 
component (with relevance determined in accordance vwth the RA). Analysis module 
133 may also respond to queries provided by the user directly. For example, the user 
may specify a search in accordance with meta-information not ordinarily associated with 
the current document or the current surroundings, requesting, for example, a particular 
kind of document (or, indeed, any document generated) the last time the user was in or 
near a specified location. 

The query is vectorized as described above in connection with the RA. Analysis 
module 133 supplies a ranked list of the most relevant documents, which may be con- 
tinually, intermittently, or upon request presented to the user over display 126. If de-^ 
sired, or upon user command, the list may be pruned to include only documents whose 
relevance level exceeds a predetermined threshold. 

It will therefore be seen that the foregoing represents a versatile and highly ro- 
bust approach to document searching and recall. The terms and expressions employed 
herein are used as terms of description and not of limitation, and there is no intention, in 
the use of such terms and expressions, of excluding any equivalents of the features 
shovwi and described or portions thereof, but it is recognized that various modifications 
are possible within the scope of the invention claimed. 

What is clamed is: 
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CLAIMS 

1. Apparatus for context-based document identification, the apparatus comprising: 

a. a database indexing a plurality of documents each in terms of meta- 
information specifying contextual information about the document; 

b. means for acquiring current contextual information; 

c. means for searching the database to identify documents whose meta- 
information comprises information relevant to the current contextual informa- 
tion; and 

d. means for reporting the identified documents to the user. 

2. The apparatus of claim 1 wherein the meta-information comprises at least one of 
(a) a user location, (b) time of day, (c) day of week, (d) date, and (e) subject. 

3. The apparatus of claim 1 wherein the meta-information comprises identification 
of a person associated with the document. 

4. The apparatus of claim 1 wherein the means for acquiring current contextual in- 
formation comprises an environmental sensor. 

5. The apparatus of claim 4 wherein the environmental sensor is a global- 
positioning system. 

6. The apparatus of claim 1 wherein the means for acquiring current contextual in- 
formation comprises means for identifying a nearby individual. 

7. The apparatus of claim 1 further comprising a system clock, the means for ac- 
quiring current contextual information being connected to the system clock and deriving 
contextual information therefi-om. 
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8. The apparatus of claim 1 wherein at least some of the meta-information is con- 
tinuous, the searching means identifying relevant information based on proximity of the 
meta-information to the current contextual information. 

9. The apparatus of claim 1 wherein at least some of the meta-information is dis- 
crete, the searching means identifying relevant information based on an exact match 
between the meta-information and the current contextual information. 

10. The apparatus of claim 1 further comprising means for adding new documents to 
the database, S2iid means comprising: 

a. a plurality of document templates, each template corresponding to a document 
type and specifying contextual information within the document; 

b. analysis means for matching a new document to a template and, in accordance 
with the template, extracting contextual information from the document; and 

c. means for indexing the document within the database in terms of the extracted 
contextual information. 

11. The apparatus of claim 1 wherein the meta-information is represented in the da- 
tabase as vectors, each vector corresponding to a document and to a type of contextual 
information associated therewith and having a value representative of the associated 
contextual information. 

12. The apparatus of claim 1 1 wherein the current contextual information is repre- 
sented as a vector, the searching means determining relevance based on the current 
contextual-information vector and the vectors in the database. 

13. The apparatus of claim 1 wherein the means for acquiring current contextual in- 
formation comprises user-responsive means for accepting user-pro\aded contextual in- 
formation. 
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14. The apparatus of claim 1 further comprising means for storing a current docu- 
ment, the means for acquiring current contextual information comprising means for 
analy^ng the current document for contextual information and extracting the contextual 
information therefrom. 

15. A method of identifying documents from a stored document database in response 
to contextual information, the method comprising the steps of 

a. indexing each stored document in terms of meta-information specifying con- 
textual information about the document; 

b. acquiring current contextual information; 

c. identifying stored documents whose meta-information comprises information 
relevant to the current contextual information; and 

d. reporting the located documents to the user. 

16. The method of claim 15 wherein the meta-information comprises at least one of 
(a) a user location, (b) time of day, (c) day of week, (d) date, and (e) subject. 

17. The method of claim 15 wherein the meta-information comprises identification of 
a person associated with the document. 

18. The method of claim 15 wherein at least some of the meta-information is con- 
tinuous, relevant information being identified based on proximity of the meta- 
information to the current contextual information. 

19. The method of claim 15 wherein at least some of the meta-information is dis- 
crete, relevant information being identified based on an exact match between the meta- 
information and the current contextual information. 

20. The method of claim 15 further comprising the step of adding new documents to 
the database according to substeps comprising: 
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a. defining a plurality of document templates, each template corresponding to a 
document type and specifying contextual information within the document; 

b. matching a new document to a template and, in accordance with the template, 
extracting contextual information from the document; and 

c. indexing the document within the database in terms of the extracted contextual 
information. 

21. The method of claim 15 further comprising the step of representing the meta- 
information as vectors, each vector corresponding to a document and to a type of con- 
textual information associated therewith and having a value representative of the asso- 
ciated contextual information. 

22. The method of clmm 21 wherein the current contextual information is repre- 
sented as a vector, relevance being determined in accordance with the current contex- 
tual-information vector and the vectors in the database, 

23 . The method of claim 1 5 wherein the current contextual information is acquired 
from a user. 

24. The method of claim 1 5 wherein current contextual information is acquired from 
a current document. 
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